Deep Efficient Data Association for Multi-Object Tracking: Augmented with SSIM-Based Ambiguity Elimination

Recently, to address the multiple object tracking (MOT) problem, we harnessed the power of deep learning-based methods. The tracking-by-detection approach to multiple object tracking (MOT) involves two primary steps: object detection and data association. In the first step, objects of interest are detected in each frame of a video. The second step establishes the correspondence between these detected objects across different frames to track their trajectories. This paper proposes an efficient and unified data association method that utilizes a deep feature association network (deepFAN) to learn the associations. Additionally, the Structural Similarity Index Metric (SSIM) is employed to address uncertainties in the data association, complementing the deep feature association network. These combined association computations effectively link the current detections with the previous tracks, enhancing the overall tracking performance. To evaluate the efficiency of the proposed MOT framework, we conducted a comprehensive analysis of the popular MOT datasets, such as the MOT challenge and UA-DETRAC. The results showed that our technique performed substantially better than the current state-of-the-art methods in terms of standard MOT metrics.


Introduction
Vision-based multiple object tracking (MOT) is a longstanding research problem with broad applications in computer vision such as intelligent surveillance systems, robotics, human-computer interaction, medical image processing, and autonomous driving.The MOT algorithm provides a robust framework for real-time monitoring and analysis of multiple moving objects, enabling accurate tracking and prediction of their movements in various dynamic scenarios.The tracking-by-detection paradigm is widely recognized as the most effective approach to multiple object tracking (MOT).It involves utilizing an efficient object-detection algorithm to identify objects within each frame of a video sequence.Subsequently, a data association algorithm is employed to establish associations between detections across frames, thereby creating object trajectories [1][2][3].Although various approaches have been presented to handle the problem, MOT is still a challenging research area due to factors like object occlusions, the varying number of objects per frame, abrupt appearance changes, etc.
The state-of-the-art MOT concepts have become more potent with recent advances in deep learning.MOT treats object detection and data association as two independent tasks.Modern advancements in deep learning have led to the development of highly effective off-the-shelf object detectors capable of accurately detecting various objects in complex scenes [4][5][6][7][8][9][10].Once object detections are obtained in each frame, the subsequent task of data association focuses on linking these detections across consecutive frames to establish object trajectories over time.Data association remains a challenging task in its own right and has not fully leveraged the advancements in deep learning.The standard process of data association typically involves extracting representative features from individual detections and then matching them with existing object trajectories using specified similarity metrics.Deep learning networks offer enhanced capability for learning robust feature representations of objects.In our study, we employed a modified VGGNet deep network for hierarchical feature learning and extraction from all detected objects.This deep feature extractor allowed us to capture distinctions and variations in object appearances, improving the accuracy and reliability of object associations over time.
After extracting features for detections in each frame, the next task is to associate these detections with previously tracked objects.This association involves comparing the extracted features of detections with those of existing object trajectories to find the most suitable matches.Detections with the highest similarity scores are linked to their corresponding trajectories.Our research mainly focuses on enhancing the data association task within multiple object tracking (MOT) frameworks.We propose an efficient association framework that integrates a deep feature association network (deepFAN) and the Structural Similarity Index Metric (SSIM) [11][12][13].This framework jointly calculates association scores for object detection-target pairs.The deepFAN learns the complex feature association function to encode the association score between detections and tracked targets, while the SSIM handles uncertainties in association by comparing its feature similarities.By combining the deep learning capabilities of the deep feature association network with SSIM, our approach aims to improve the accuracy and reliability of object associations across frames.
Traditionally, affinities in multiple object tracking (MOT) are calculated by exhaustively evaluating all possible pairs of detection and target features.In contrast, our proposed MOT framework integrates a neighborhood detection estimation (NDE) module to refine this process, selecting a more reliable subset of detection-target pairs.The NDE module enhances efficiency by focusing on nearby or contextually relevant detections rather than evaluating every possible permutation.This filtering step improves the quality of associations by prioritizing those with a higher likelihood of accuracy.In our framework, the deep feature association network (deepFAN) and the Structural Similarity Index Metric (SSIM) jointly determine the association score for these refined pairs.Furthermore, the training method we employed for the deep feature association network (deepFAN) enables efficient object association across multiple frames in a video sequence, ensuring reliable trajectory tracking.During training, the network is exposed to input frame pairs that are not necessarily consecutive.This strategy proves beneficial by allowing the framework to link objects across non-adjacent frames.This capability reduces instances of identity switches and fragmented object trajectories.By integrating the NDE module and optimizing deepFAN training, our MOT framework enhances tracking accuracy while maintaining computational efficiency.
This study presents a systematic approach to estimating an efficient association matrix for multiple object tracking (MOT), which effectively summarizes the correspondence between current frame detections and previously estimated target trajectories.Leveraging the capabilities of deep learning architectures, our proposed framework integrates innovative components aimed at enhancing MOT performance.Key components of our approach include the following:

•
The proposed data association framework employs the deep feature association network (deepFAN) along with the Structural Similarity Index Metric (SSIM) to estimate an efficient association matrix.This combination improves the robustness of object associations by leveraging deep learning for feature extraction and similarity evaluation.

•
In the proposed data association framework, a neighborhood-detection-estimation (NDE) scheme is introduced to select a reliable subset of detection-target pairs.This neighborhood detection estimation, along with post-processing steps within the deep feature association network, contributes to enhancing the computational efficiency.Experimental evaluations highlight that the proposed approach minimizes incorrect associations, thereby improving overall tracking performance.

•
A specialized training strategy is developed for the deep feature association network (deepFAN), allowing the network to utilize non-consecutive frame pairs for the ef-fective learning of the data association function.This method improves the overall ability of the network to link objects across frames, thereby reducing identity switches and fragmented trajectories.
We validated the effectiveness of each component through ablative experiments on the MOT validation dataset.Additionally, comprehensive analyses on the benchmark datasets, including MOT15, MOT17, MOT20, and UA-DETRAC, demonstrated that our method achieved competitive and state-of-the-art results across various MOT evaluation metrics.The MOT metric scores for identity switches, fragmentation, and false negatives were reduced, indicating the reduction in the wrong association among detection target pairs.
The rest of the article is organized as follows: Section 2 reviews the existing literature on multiple object tracking (MOT).Section 3 details the methodology employed in the online multiple object tracking framework.In Section 4, we present the experimental findings and comparative results and discuss them in depth.Finally, Section 5 concludes the study with a summary of the findings and suggestions for future research directions.

Related Works
To obtain a comprehensive overview of the multiple object tracking (MOT) problem, we refer to foundational studies [14][15][16].Within MOT frameworks, the tracking-bydetection approach is the most commonly utilized method [1][2][3].The effectiveness of this approach relies heavily on the quality of object detections and the accuracy of trajectory estimation.The recent advancements in deep learning have significantly improved object detectors [4][5][6][7][8][9][10], leading to better object detection performance and, consequently, enhancing the overall efficiency of the MOT framework.
This discussion will focus specifically on data association approaches used for trajectory estimation in MOT.An essential step for any data association method is computing representative features of the detections in each frame.Several approaches exist for determining representation models, including appearance-based [17][18][19], motionbased [20][21][22], and composite models [23,24].For MOT frameworks, deep learning-based feature extraction methods provide robust and discriminative representation models for object detections, which significantly boost tracking performance.Typically, pre-trained classification or object detection models are employed for feature extraction in tracking tasks [25][26][27][28][29].In particular, ShiJie et al. [30] proposed a deep affinity network that jointly learns representational features and their affinities with targets.Our proposed MOT framework adopts the feature extraction model utilized in [30].
The study by Emami et al. [31] views data association as a multidimensional assignment problem and consolidates many popular learning algorithms employed for MOT data association.Researchers have explored various methodologies, including non-probabilistic algorithms, probabilistic graphical models, Markov Chain Monte Carlo (MCMC), and deep learning techniques to solve the data association problem.Among non-probabilistic approaches, the Greedy Randomized Adaptive Search Procedure (GRASP) is frequently used for multi-sensor multi-object tracking [32].In probabilistic graphical models, common techniques include network optimization [33,34], conditional random fields [35][36][37], and belief propagation [38,39].Additionally, MCMC has been a valuable tool for data association in multiple object tracking [40,41].
In recent years, there have been numerous successful attempts to formulate data association in MOT using deep learning methods.The Deep Affinity Network (DAN) proposed by ShiJie et al. [30] represents an end-to-end trainable deep network that jointly learns feature modeling and association estimation.Similarly, FAMNet [42] leverages deep networks for both feature extraction and association estimation.Yihong et al. introduced the Deep Hungarian Network (DHN) [43], which predicts associations from a cost matrix derived from detections and targets.The Dual-Matching Attention Network (DMAN) [44] employs spatial and temporal attention mechanisms to predict and refine association assignments.The integration of deep models such as Recurrent Neural Networks (RNNs) [45], autoen-coders [46], and Generative Adversarial Networks (GANs) [47] into the data association problem has led to significant improvements in MOT performance.
This work presents a systematic approach to data association within the MOT framework that harnesses the power of deep learning models.The proposed MOT algorithm for track association enhances both computational efficiency and tracking accuracy.By leveraging the potential of deep learning models, our method aims to address the complexities and challenges associated with data association in MOT, ensuring more reliable and effective tracking outcomes.

Methodology: Online Multiple Object Tracking Framework
In the tracking-by-detection paradigm of MOT, the process involves two distinct modules: the object detector and the object tracker.The object detector initially identifies target objects by generating bounding boxes in each video frame.From these bounding boxes, we compute the center locations of objects, C f D for frame I f .Our proposed MOT framework is designed to seamlessly integrate with existing multi-object detection methods.We evaluated our approach across various online challenges in multiple object tracking, where state-of-the-art object detectors provide the initial object detections.Specifically, we utilized detections from prominent MOT challenges such as MOT15, MOT17, MOT20 [48,49], and UA-DETRAC [50,51].Each challenge provides video sequences annotated with detections generated by specific detectors designated for the challenge.
The block diagram representation of the proposed MOT framework is shown in Figure 1.One of the significant components of the proposed MOT framework is a deep feature extractor using a modified VGGNet architecture.The architecture of the feature extractor employed in the proposed framework is based on the state-of-the-art MOT framework described in Reference [30].The system is expertly developed to efficiently extract comprehensive and compact features from the input object detections.The pretrained VGGNet architecture is fine-tuned within the context of multiple object tracking using training sequences of the MOT benchmark.As shown in Figure 1, the representative feature of each object is obtained by passing the current video frame I f and object centers C f D through the deep feature extractor.For each object detection, a 520-dimensional feature vector is obtained.We refer to [30] for the architectural details of the modified VGGNet feature extractor.The input to the data association network is Φ, and the output is the association matrix, A ∈ R N TL ×N d ; the scalar scores, A (i,j) represents the association score between the j th detection and i th target.Trajectory list T updated using association matrix A for i=1 represent the set of detections given in frame I f , where N d is the number of available detections.We acquire a detection feature matrix, F f D ∈ R 520×N d , by accumulating the 520-dimensional feature vector for N d detections for each input frame I f .This detection feature matrix F f D is then made available for the data association task.

Data Association Methodology
In this section, we extend our discussion on the proposed data association framework that incorporates a deep feature association network and the Structural Similarity Index Metric along with neighborhood detection estimation to tackle the problem.The association algorithm identifies a correspondence between the object detections in the current frame and existing trajectories from the previous frames.This involves comparing the extracted features of detections with those of existing object trajectories to find the most suitable matches.Detections with the highest similarity scores are linked to their corresponding trajectories.Here, we employed a deep feature association network (deepFAN) that consists of a pre-trained CNN-based compression network and an image similarity metric, the Structural Similarity Index Metric (SSIM), to estimate the data association efficiently.This part of the MOT framework computes a feature association matrix A, which encodes the pairwise similarities of the detections and the pre-existing targets.

Neighborhood Detection Estimation
Generally, the data association matrix models a global relationship between all the detections in the current frame and the tracked targets from the previous frames.In the proposed method, instead of considering all the combinations, only the reliable detectiontarget pairs are chosen for the association task.The neighborhood detection estimation methods are employed to identify those detection-target pairs.This method is based on the assumption that the objects are in a smooth motion, i.e., the location of the objects did not drift drastically in subsequent video frames.Therefore, we have to consider only the detections in the neighborhood area of the targets for the data association.
Let F f −1 TL ∈ R 520×N TL represent the set of target feature vectors in the previous frame I f −1 , including the feature vectors of tracked and lost targets.
are the feature matrices that consist of the feature vectors of the tracked and lost targets from the previous frame I f −1 and N T and N L are the number of active tracked and lost targets.
The neighborhood-detection-estimation algorithm simply relies on the distance between the centers of the detection and target feature vectors.In order to find the distance, we need to define a distance metric.Here, we are adopting the Euclidean distance with an additional scaling factor.Let C D = {C D x , C D y } and C TL = {C TL x , C TL y } be the centers of the detections and targets.The scaled Euclidean distance E s between a detection and a target with centers (c D x , c D y ) and (c t x , c t y ) is defined as where (I x , I y ) represents the size of the video frame.

Optical flow-based motion prediction:
From the object detection bounding boxes, we can determine the center locations of all the detections in the image frame, C f D .Further, we have the locations of the tracked and lost targets C f −1 TL in the frame I f −1 as feedback information from the previous target trajectories.The possible locations of these targets in the present frame I f , Ĉ f TL , are estimated using the optical flow motion model.Specifically, knowing the target center in t y }, we compute its corresponding location ĉ f t in the following frame (I f ) using the Lucas-Kanade optical flow method with pyramids [52]. where . Using optical-flow-based motion prediction, the location of a lost target is continuously updated.Consequently, if the target is occluded in one frame and reappears at a different location in subsequent frames, this motion prediction aids in estimating the likely location of the lost target.This approach improves the efficiency of reidentifying the lost target, leading to more accurate and reliable tracking performance.
Using Equation ( 2), we calculate the distance between each existing target and all detections D f and select only those targets within the distance threshold, T e , to prioritize nearby detections.The network then encodes the feature vectors of all the possible pairings between the targets and the respective neighboring detections into a tensor, termed the feature permutation matrix Φ ∈ N × N × (520 × 2).For clarity, the dimension of the tensor Φ is described as Width × Height × Depth, where the width represents the targets and the height represents the detections.The feature vector of each target is concatenated with the feature vector of each one of its neighboring detections and arranged in Φ along its depth dimension.For each image frame in the video sequence, the number of targets and detections will vary.To maintain consistency in the tensor dimension, we introduce additional zero vectors into the matrix, ensuring that the size consistently remains at N × N × 1040.The value chosen for N limits the maximum number of object detections in each frame, and through our analysis, N = 80 was found to be a generous bound for the MOT benchmark datasets.

Deep Feature Association Network
The objective of this component in the proposed MOT framework is to estimate the affinities between the selected detection-target pairs using the extracted feature vectors.This sub-network maps the tensor Φ ∈ R N×N×1040 into a feature association matrix A F ∈ R N×N .In the association matrix A F , the columns account for the detections in the current frame and rows account for the active targets, both tracked and lost, from the previous trajectory.Besides, the scalar score in the matrix A F (i,j) indicates the confidence of the j th detection and i th target (d ) associated with the same identity.We refer to the major component of this module as the deep compression network due to its functionality.The architecture of the deep compression network is inspired by the work presented in [30].The input to this network is the tensor Φ ∈ R N×N×1040 , which accumulates the feature vectors of target-detection pairs.The output is an association matrix A F ∈ R N×N that encodes the similarity scores of these pairs.The specifications of the deep compression network architecture are detailed in Table 1.This network employs a five-layer convolutional neural network with 1 × 1 kernels for the task.As the tensor Φ passes through the network, it undergoes gradual dimension reduction along the depth dimension via the 1 × 1 kernels.These convolutional kernels enable the computation of similarity scores for each object pair without interference from neighboring objects.
Training deep compression network: During the training process, the deep compression network learns the association function, which estimates the feature association matrix A F ∈ R N×N from the tensor Φ ∈ R N×N×1040 for reliable online multiple object tracking.The approach used to train the compression network is illustrated in Figure 2. When we employ the proposed MOT framework (Figure 1) for online tracking, the feature extractor functions as a single-stream model.Additionally, during the tracking process, the input frames are presented in the order of the original video.We develop a specialized training strategy for the deep compression network, which enables the network to effectively learn the data association function by utilizing non-consecutive frame pairs from the video sequence.As a result, the network learns to reliably associate objects in a given frame with those in multiple previous frames, benefiting the framework by reducing identity switches and fragmented target trajectories.As shown in Figure 2, during training, we configured the network as a two-stream network of modified VGGNet with shared parameters.The feature extractor receives two frames, I f and I f −p , separated by p frames (i.e., not adjacent frames), as well as the centers of object detection, C f and C f −p , of pre-detected objects within those frames.These frame pairs are processed by modified VGGNets, which extract a 520-dimensional feature vector for each object detection in the input frames.We obtain feature matrices, F f D and F f −p D , which accumulate the feature vectors for detections in each input frame I f and I f −p .Since the input frames are non-adjacent, neighborhood detection estimation (NDE) is not applicable and is excluded from the training pipeline.The network arranges the columns of F f and F f −p to concatenate the columns of the two feature matrices along the depth dimension of the tensor Φ ∈ R N×N×1040 in all possible permutations.To maintain consistency in the tensor dimensions, additional zero vectors are introduced, ensuring that the size remains N × N × 1040.This tensor is then forward-passed through the compression network, which utilizes five convolutional layers with 1 × 1 kernels to map and estimate the feature association matrix A F ∈ R N×N .For computing the error of the network during the learning process, we define a loss function J with the help of ground truth trajectories.A ground truth target association matrix G ∈ R N×N is constructed as a binary matrix encoding the correspondence between the objects detected in frames I f −p and I f .If the i th target in I f −p corresponds to the j th target in I f , then the entry to the matrix G f −p, f (i,j) is non-zero; otherwise, it is zero.The ground truth target association matrix G is subsequently compared with the network-predicted feature association matrix A F , for the loss computation.The loss function of our training network is defined as where the symbol ⊙ represents the Hadamard product.The log operation on A F is performed elementwise, and ∑ i,j=1:N finds the sum of all elements in the Hadamard product matrix.In the loss function, instead of computing the distance metric between the predicted association matrix A F and ground truth association matrix G, the probabilities encoded by the relevant coefficients of A F are maximized.During learning, the parameters of the compression network are updated by minimizing the loss over the training samples.
The trained compression network is employed in online multiple object tracking.Referring to Section 3.1.1,for consistency, additional zero vectors are introduced in the tensor Φ, so that the size will always be N × N × (520 × 2).Therefore, in the association matrix A F ∈ R N×N , there are irrelevant values corresponding to the appended zero vectors.To reduce the irrelevant information and to normalize the matrix, we performed the following three post-processing steps over the feature association matrix A F : (i) Truncation: Since we have only N d detections and N TL active targets, the matrix A F ∈ R N×N is resized by truncating the matrix to N TL × N d .(ii) Rowwise Softmax: This operation normalizes the rows of the association matrix by fitting a separate probability distribution.The output row values are between the range [0, 1], and the total sums up to 1. Thus, each row of the resulting association matrix encodes the association probability between each active target in I f −1 and all the detections in I f .(iii) Thresholding: The association matrix values indicate the similarity between the detection and target objects.For a reliable data association, the values above the threshold T a are retained, and all other values below the threshold are set to zero.
These post-processing steps obtained for us an updated feature association matrix A F ∈ N TL × N d , which was further passed to the SSIM for the association update.

Structural Similarity Index Metric for Association Update
The ultimate aim of the data association module is to develop a robust association model that delivers the most relevant information for achieving accurate multiple object tracking (MOT) performance.In the association matrix, a non-zero association value indicates a potential match between the corresponding target-detection pair.Traditionally, the detection with the highest association score is linked to the target trajectory.However, when multiple detections have similar or nearly equivalent association scores, uncertainties arise, leading to unreliable associations between detections and targets.
To address this issue, our proposed method incorporates the Structural Similarity Index Metric (SSIM) [11][12][13].The SSIM is a widely recognized perceptual metric that measures the similarity between two images by leveraging their structural characteristics.By integrating the SSIM, we enhance the decision-making process for target associations.The proposed MOT framework considers the association results derived from the SSIM to make the final decision on the target association.This metric evaluates the effective similarity between the target and detection pairs, thereby improving the accuracy and reliability of the associations.We reduced the chance of wrong associations, which can happen when multiple detections have association scores that are very close to each other by using the SSIM.This makes sure that the detections are more accurately aligned with their targets.
Let (TL) i be the i th active target and {d k } K k=1 be the detections corresponding to the non-zero association scores with the i th target.Also, let d max represent the detection with the highest score and A F (i,max) be the highest score.As stated before, if there are other detections with similar or closer scores to this highest association score A F (i,max) , uncertainties occur in the target association.For the target (TL) i , first, the set of detections with uncertainty D s is estimated as follows. ( If the association matrix A F contains any zero rows, then the corresponding detection set in D s becomes an empty set.The SSIM module calculates the similarity score between the target and each detection in D s .The output of the SSIM module is another SSIM association matrix A S ∈ R N TL ×N d in which rows and columns represent the same active targets and detections as in A F , but the entries replace the SSIM score of each valid pair, i.e., The SSIM-based association matrix, A s , is utilized alongside A F to establish the final track association, A. The track association matrix is the result of adding both matrices A F and A s together.

Track Association
In a multiple object tracking scenario, an object detected in a video sequence has to undergo different state transitions.When the object detector detects the object for the first time, a new track is initialized in the trajectory list.Now, the object is in the tracked state and remains in the same state until re-detected in the subsequent frames.When the object gets occluded or goes out of the camera's field of view, the object is transferred to the lost state.If the lost target re-appears, then the state is updated as tracked, and the tracking process resumes.The trajectory of the lost target is terminated if it stays long in the lost state.The data association algorithm in MOT helps to find the state of each detection in the video sequence.It estimates the correspondence between the object detections in the current frame and existing targets.
After accomplishing the training of the deep compression network with MOT datasets, we employed the trained network in the proposed MOT framework.Algorithm 1 summarizes the online tracking process in the proposed method.The objective of the MOT problem is to find the trajectory of all the possible targets present in the given input image sequence.Here, the MOT framework expects the present image frame I f and the object detection centers C f D as its inputs.The detection feature matrix F f D computed by the VGGNet feature extractor along with the target feature vector matrix F f −1 TL are utilized to create the feature permutation tensor Φ by a concatenation operation.We stored the feature vectors of the active targets, both tracked and lost targets, from the previous frame to find the association in the current frames.The tensor Φ forward-passed through the compression network is mapped to the association matrix A F as described in Section 3.1.2.Along with A F , the SSIM-based association matrix A s is also utilized for finding the final track association, A. The track association method adapted in our framework is performed as follows.
Algorithm 1 Online multiple object tracking.
Initialize trajectory τ 1 i for each detection, Hungarian algorithm assigns detection to active targets.

27:
Input : A In the first frame I 1 , we initialize the trajectory list T with tracks {τ i } N d i=1 by considering all the detections present in it as new tracked targets.Here, a track τ i is an ordered set of the states of the i th target in the video sequence.
In Equation (7), f e and f t are the entry and terminate frame for the i th target, (c x , c y ) is the center of the target, and (w, h) are the width and height of the target.For each new target entry, the track is initialized with τ i = s f e i .The trajectory list is updated after each input frame by employing the Hungarian algorithm [53] on the final association matrix A. In the track association part, the targets under the tracked state get higher priority.In this process, the targets that are associated with the detections are labeled as tracked, and the targets without association are labeled as lost.If the target stays in the lost state for a long time (say N inact as the length of frames; here, we chose the value as 20 frames), it is considered that the object has entered an inactive state, and we terminate the trajectory corresponding to that object.Finally, we initialize new trajectories for the detections that are not associated with the tracked targets.

Experiment Results and Discussion
In this section, we experimentally demonstrate the performance of the proposed deep MOT framework on the popular MOT benchmark datasets using the standard metrics.Here, we present the implementation details of our MOT framework, followed by the benchmark datasets and metrics used for performance analysis.We first conducted an ablation study on the validation dataset to understand the behavior of our approach better.Further, to obtain an authoritative reference when addressing MOT problems, the proposed framework was evaluated on the test datasets and the results compared with the state-of-the-art methods.
MOT benchmark datasets: The three popular MOT datasets, namely MOT15, MOT17, and MOT20 from the MOT Challenge [48,49], and UA-DETRAC [50,51] were employed here to test the performance of the proposed approach.These are the centralized benchmark datasets used to evaluate the tracking techniques in online multiple-object-tracking challenges.The annotated training video sequence, which includes the object detections and the ground truth labels in each frame, is used to train the models.The video sequences in the test datasets provided only object detections, whereas the ground truth labels remained unrevealed.Once the new MOT tracker has been submitted for performance analysis, the online MOT challenge hosting server evaluates the tracking results based on the standard MOT metrics [54].

Implementation Details
The proposed MOT framework was implemented in a Python framework, and the training was conducted on an NVIDIA Geforce Titan Xp 12GB GPU.We performed the training of the deep compression network on the MOT17 training dataset using the SGD optimizer.The hyperparameter values finally used in the training process were as follows: a batch size of 8, momentum of 0.9, an initial learning rate of 0.01, a weight decay of 0.0001, and the number of epochs per model of 120.
In the proposed MOT framework, the decision for the state transition of a target from lost to inactive is based on the hyperparameter N inact , which is the maximum number of frames the target stays in the lost state before being transferred into an inactive state.In our analysis, we kept the value for N inact at 20.We chose N = 80 as a generous bound for the MOT benchmark datasets, because it limits the maximum number of object detections in each frame.The feature extractor network has an input frame size of 900 × 900.Therefore, the network first resizes all the training and testing data to these dimensions before passing them through.The two threshold parameters used in this proposed framework are distance threshold, T e and association threshold T a .The optimum value for the evaluation metrics obtained with the value of T e is equal to 0.35.In the thresholding step implemented as a post-processing part of the feature association matrix A F , we used a association threshold T a .For T a equal to 0.4, the proposed MOT framework obtained the optimum performance.The selection of T e and T a is explained in section 4.2.

Ablation Study
In this section, to gain a deeper insight into the proposed MOT framework, we experimentally evaluated the contribution of different tracking components.Since the ground truth annotations are not provided for the MOT test datasets, the ablation study was conducted on the MOT15, MOT17, and MOT20 training datasets.We split the MOT training dataset into training and validation datasets.The splitting of the dataset is presented in Table 2.The proposed framework was trained on the training sequences, and the performance was evaluated on the validation sequences, as provided in Table 2.This section follows the detailed analysis and discussion on the results obtained for the analyses of the three main components, (i) neighborhood estimation detection, (ii) feature association network, and (iii) SSIM association update.To investigate the significance of each component, we conducted several experiments by disabling one element at a time and studying the performance for the MOT metrics.Table 3 consolidates the evaluation results of the variants of the proposed method on all MOT evaluation metrics that demonstrate the significance of each module in the framework:

(i) Neighborhood detection estimation:
As we discussed earlier, using NDE, we limited the search space for the association of the particular target into its neighborhood, assuming that the target will not move drastically from its position in a single frame change.The neighborhood of the target object was set to a limit using a distance threshold T e .Figure 3 shows the MOTA and IDF1 with different values for distance threshold T e .The optimum value for the evaluation metrics obtained with the value of T e is equal to 0.35, and we used this value of T e for the further experiments.
To demonstrate the significance of the proposed NDE in the MOT framework, we compared the performance of the trackers with and without NDE.Figure 4 shows three essential MOT metrics, MOTA, MOTP, and IDF1, evaluated on both the MOT17 and MOT20 validation datasets.Also, Table 3 tabulates the experimental results on all MOT metrics evaluated on the MOT17 and MOT20 validation datasets.The MOTA metric measures the overall accuracy of the detection and tracking, whereas the IDF1 scores highly depend on the association accuracy.The MOTP deals with the detection output.It is evident from the MOT scores that the scores improved with NDE.The MOTA is a metric derived from three types of detection-association errors: false positives, false negatives (missed targets), and identity switches.Since the NDE employed in the proposed method helps to choose only the reliable pairs for the association, it reduces the chance of wrong associations during the track estimation.It is clear from the results that NDE helps to reduce the wrong association, thereby reducing the identity switches, fragmentation, false negatives, and false positives, which in effect improves the MOTA.Also, the improvement in the IDF1 score also justifies that, with NDE, the association accuracy is improved.(ii) Deep feature association network: The deep feature association network (deepFAN) estimates the association matrix that encodes the association scores of each detection-target pair.The module includes three post-processing steps that remove the irrelevant information from the association matrix, improving the trajectory estimation.In the thresholding step, we used a hyperparameter, threshold T a .Figure 5 plots the MOTA and IDF1 scores of the proposed MOT framework with different values of T a , and an optimum result was obtained for T a equal to 0.4.   3 show the performance of the proposed training strategy on the MOT17 and MOT20 validation sequences in terms of the MOT metrics.The deep network was trained on the MOT dataset with the strategy that the input frames need not be sequential, i.e, non-consecutive input frames.Therefore, the data association model becomes robust to the tracking challenges such as appearance variation, illumination changes, scale changes, etc.It also helps in the re-identification of the lost targets and handles object occlusions, thereby reducing the identity switches and fragmentation issues in MOT.The experimental results showed that it improves the overall MOT performance.(ii) SSIM association update: The SSIM introduced in the proposed model can be considered as a second opinion when an ambiguity in association occurs.Figure 7 and Table 3 show the importance of SSIM association by evaluating the model on MOT metrics with the MOT17 and MOT20 validation data sequences.As the performance of the data association algorithm improves, we will obtain better association estimation, which will enhance the tracker's tracking performance.It is observed from the results that the SSIM enhances the performance of the data association algorithm.It reduces the false negatives and identity switches and, hence, the MOTA.Also, the high IDF1 score validates the significance of SSIM association in the refinement of the association matrix.

MOT Benchmark Evaluation
This section shows the experimental evaluation of the proposed method on the benchmark datasets.Table 4 summarizes and compares our results with state-of-the-art algorithms on MOT benchmark datasets and Table 5 on UA-DETRAC.Here, we also show the effects of systematically adding neighborhood detection estimation, non-sequential training, and SSIM association update in the proposed tracker.
The benchmark evaluation results show that the proposed MOT framework performs very well in terms of the MOT evaluation metrics.We would like to emphasize that the metric scores for identity switches, fragmentation, and false negatives are reduced, indicating the reduction in the wrong association among detection target pairs.This results in better accuracy (MOTA).Also, the IDF1 score is improved, which is a clear indication of the association accuracy.This shows the robustness and efficiency of the proposed data association method.
We compared our results with recent state-of-the-art methods.The benchmark evaluation result depicts that the proposed data association method outperforms the state-ofthe-art DAN model [30].In particular, the nearest neighborhood estimation employed for detection-target feature pair selection reduces the association mismatch and improves the computational efficiency.The post-processing steps after deepFAN also help enhance the association accuracy and reduce the computational complexity.Here, the employment of the SSIM reduces the ambiguity in the data association.The tracking results of the proposed tracker with the UA-DETRAC dataset are summarized in Table 5.Here, we opted for the EB detector [63] for a fair comparison.Since the trackers in Table 5 used different detectors, the name of the tracker is given along with the detector used.The proposed method gives better results on the UA-DETRAC evaluation compared with other approaches and can also be effectively used for vehicle tracking.

Conclusions
Developing a better data association framework is very crucial for robust multiple object tracking.This research work proposes two important contributions to enhance the data association.The first one is by introducing neighborhood detection estimation (NDE) only to retain reliable detection-target pairs.Secondly, the SSIM association component is proposed to reject ambiguous associations with high or near high association scores.A comprehensive evaluation strategy was adopted to understand and study the impacts of our technical contributions on popular multiple object tracking benchmarks.Further, we carried out a systematic ablation study to pinpoint the benefits of each proposal.We compared our proposals with recent multiple object tracking frameworks.Our studies found that the proposed tracker gave very low identity switches, which is one of the crucial factors in ranking various trackers.Further, the proposed tracker also achieved very high overall MOTA and IDF1 scores.Another factor that we wish to highlight here is that the proposed framework rejects ambiguous associations and employs only the neighboring detections for data associations.Ultimately, this leads to achieving higher tracking speed, which is another important factor in multiple object tracking.In the future, we would like to deploy this tracker in real-time tracking scenarios by augmenting a dedicated object-detection module along with the proposed tracker for real-world applications.

Figure 1 .
Figure 1.Representation of the proposed MOT framework: Inputs to the framework are the current frame I f and the centers of the object detections C f D ; the output is the estimated trajectory of the all the targets for frames till I f .The two main proposals in the framework are neighborhood detection estimation (NDE) and data association framework with deepFAN and SSIM.The detection feature matrix F f D obtained from the deep feature extractor and matrix with existing targets feature vector F f −1 TL are given to NDE to find reliable detection-target pairs and encode them on 3D tensor Φ.The input to the data association network is Φ, and the output is the association matrix, A ∈ R N TL ×N d ; the scalar scores, A (i,j) represents the association score between the j th detection and i th target.Trajectory list T updated using association matrix A for I f .

Figure 2 .
Figure 2. Approach to train the deep feature association network (deepFAN):Though an online MOT framework, the feature extractor is deployed as a single-stream model, and during training, it is considered as a two-stream network with shared parameters.The inputs to the feature extractor are the those two frames (I f and I f −p , which are p frames apart, meaning they need not be adjacent frames), and we need to find the association between the detections and the centers of the object detections (C f and C f −p for I f and I f −p ).Since the input frames are non-adjacent, neighborhood detection estimation (NDE) is not valid and is not applied in the training pipeline.With the supervision of ground truth G f −p, f , the cost function J is computed, and the weights of the deep compression network in deepFAN are updated.

Output : A s 22 :Final track association matrix 23 : 24 :
for each active target, (TL) i , do 18: find D s i detections with uncertainty.Input : (TL) i and D s i , i = 1 : N TL 21: Input : A F and A s Output : Final track association matrix, A = A F + A s 25:

Table 2 .
Training and validation data sequences for ablation study of proposed MOT framework on the MOT17 and MOT20 benchmark.

Figure 3 .Figure 4 .
Figure 3. Performance analysis of proposed MOT framework with different values for NDE distance threshold, T e .MOTA and IDF1 with different values for distance threshold T e on the MOT17 validation dataset are evaluated to find the optimum value of T e .

Figure 5 .
Figure 5. Performance analysis of proposed MOT framework with different values for threshold T a .The MOTA and IDF1 with different values for threshold T a on the MOT17 validation dataset are evaluated to find the optimum value of T a .

Figure 6
Figure 6 and Table3show the performance of the proposed training strategy on the MOT17 and MOT20 validation sequences in terms of the MOT metrics.The deep network was trained on the MOT dataset with the strategy that the input frames need not be sequential, i.e, non-consecutive input frames.Therefore, the data association model becomes robust to the tracking challenges such as appearance variation, illumination changes, scale changes, etc.It also helps in the re-identification of the lost targets and handles object occlusions, thereby reducing the identity switches and fragmentation issues in MOT.The experimental results showed that it improves the overall MOT performance.

Figure 6 .
Figure 6.Analysis of deep feature association network on MOT17 and MOT20 validation sequences.

Figure 7 .
Figure 7. Analysis of Structural Similarity Index Metric on MOT17 and MOT20 validation sequences.

Table 1 .
Architectural details of the deep compression network.Here, we use stride = 1 and ReLU activation in each layer.BN indicates batch normalization, and Y/N denotes whether B is applied or not.

Table 3 .
Analysis of the proposed framework on the MOT validation datasets and comparison with different proposed tracker variants by disabling different components.(The best values are in boldface.↑ indicates that a higher value is better, and ↓ indicates a lower value is better.)

Table 4 .
Comparison of the proposed MOT framework on the MOT test dataset with state-of-the-art trackers.(Red for the best values and blue for second place.NA represents the values that are not available in the publications.↑ indicates that a higher value is better, and ↓ indicates a lower value is better.)

Table 5 .
Comparison of the proposed MOT framework on the test dataset, UA-DETRAC, with stateof-the-art trackers.(Red for the best values and blue for second place.NA represents the values that are not available in the publications.)